{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "# 基于物品的协同过滤算法<sup><a id=\"fnr.1\" class=\"footref\" href=\"#fn.1\">1</a></sup>\n",
    "\n",
    "**目前业界应用最多的算法。**\n",
    "\n",
    "**给用户推荐和他们之前喜欢的物品相似的物品。**\n",
    "\n",
    "**其主要通过分析用户的行为记录计算物品之间的相似度。物品A和物品B具有很大的相似度是因为喜欢物品A的用户大都也喜欢物品B。**\n",
    "\n",
    "**ItemCF** 可利用用户的历史行为给推荐结果提供推荐解释。\n",
    "\n",
    "ItemCF算法主要分为两步：\n",
    "\n",
    "1.  计算物品之间的相似度；\n",
    "2.  根据物品的相似度和用户的历史行为给用户生成推荐列表。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "从“Customers Who Bought This Item Also Bought”出发，用下面的公式定义物品的相似度：\n",
    "\\begin{equation}\n",
    "w_{ij} = \\frac{|N(i) \\cap N(j)|}{|N(i)|} \\nonumber\n",
    "\\end{equation}\n",
    "\n",
    "分母 $|N(i)|$ 是喜欢物品 $i$ 的用户数，分子 $|N(i) \\cap N(j)|$ 是同时喜欢物品 $i$ 和物品 $j$ 的用户数。可理解为 **喜欢物品 $i$ 的用户中有多少比例的用户也喜欢物品 $j$** 。\n",
    "\n",
    "为避免推荐出现热门的物品，用下面的公式：\n",
    "\n",
    "\\begin{equation}\n",
    "w_{ij} = \\frac{|N(i) \\cap N(j)|}{\\sqrt{|N(i)||N(j)|}} \\nonumber\n",
    "\\end{equation}\n",
    "\n",
    "此公式惩罚了物品 $j$ 的权重，减轻了热门物品会和很多物品相似的可能性。\n",
    "\n",
    "两个物品产生相似度是因为它们共同被很多用户喜欢，即每个用户都可以通过他们的历史兴趣列表给物品“贡献”相似度。其中蕴涵一个假设，即 **每个用户的兴趣都局限在某几个方面** ，如果两个物品属于一个用户的兴趣列表，那么这两个物品可能就属于有限的几个领域，而如果两个物品属于很多用户的兴趣列表，那么它们就可能属于同一个领域，因而有很大的相似度。\n",
    "\n",
    "用 **ItemCF** 算法计算物品相似度时也可以首先建立用户——物品倒排表（即对每个用户建立一个包含他喜欢的物品的列表），然后对每个用户，将他物品列表中的物品两两在共现矩阵 $C$ 中 **加1** 。\n",
    "\n",
    "伪代码：\n",
    "\n",
    "```python\n",
    "def item_similarity(train):\n",
    "    import math\n",
    "\n",
    "    # calculate co-rated users between items\n",
    "    C = dict()\n",
    "    N = dict()\n",
    "    for u, items in train.items():\n",
    "        for i in items:\n",
    "            N[i] += 1\n",
    "            for j in items:\n",
    "                if i == j:\n",
    "                    continue\n",
    "                C[i][j] += 1\n",
    "\n",
    "    # calculate final similarity matrix W\n",
    "    W = dict()\n",
    "    for i, related_items in C.items():\n",
    "        for j, cij in related_items.items():\n",
    "            W[u][v] = cij / math.sqrt(N[i] * N[j])\n",
    "    return W\n",
    "```\n",
    "\n",
    "<code>C[i][j]</code>记录了同时喜欢物品 $i$ 和物品 $j$ 的用户数。将 $C$ 矩阵归一化可得到物品之间的余弦相似度矩阵 $W$ 。\n",
    "\n",
    "得到物品之间的相似度后， **ItemCF** 通过如下公式计算用户 $u$ 对一个物品 $j$ 兴趣：\n",
    "\n",
    "\\begin{equation}\n",
    "p_{uj} = \\sum_{i \\in N (u) \\cap S (j, K)} w_{ji} r_{ui} \\nonumber\n",
    "\\end{equation}\n",
    "\n",
    "**和用户历史上感兴趣的物品越相似的物品，越有可能在用户的推荐列表中获得比较高的排名。**\n",
    "\n",
    "$N (u)$ 是用户喜欢的物品的集合， $S (j,K)$ 是和物品 $j$ 最相似的 $K$ 个物品的集合， $w_{ji}$ 是物品 $j$ 和 $i$ 的相似度， $r_{ui}$ 是用户 $u$ 对物品 $i$ 的兴趣。（对于隐反馈数据集，如果用户 $u$ 对物品 $i$ 有过行为，即可令 $r_{ui}=1$ 。）\n",
    "\n",
    "伪代码如下：\n",
    "\n",
    "```python\n",
    "def Recommendation(train, user_id, W, K):\n",
    "    rank = dict()\n",
    "    ru = train[user_id]\n",
    "    for i, pi in ru.items():\n",
    "        for j, wj in sorted(W[i].items(), key=itemgetter(1), reverse=True)[0:K]:\n",
    "            if j in ru:\n",
    "                continue\n",
    "            rank[j] += pi * wj\n",
    "    return rank\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "autoscroll": false,
    "collapsed": false,
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "def item_based_recommend(data, w, user):\n",
    "    \"\"\"\n",
    "    基于物品相似度为用户 user 推荐物品\n",
    "\n",
    "    Args:\n",
    "    - data: mat, 物品用户矩阵\n",
    "    - w: mat, 物品与物品之间的相似性\n",
    "    - user: int, 用户编号\n",
    "\n",
    "    :return: predict, list, 推荐列表\n",
    "    \"\"\"\n",
    "\n",
    "    m, n = np.shape(data)  # m: 物品数量 n: 用户数量\n",
    "    interaction = data[:, user].T  # 用户 user 互动物品信息\n",
    "\n",
    "    # 找到用户 user 没有互动的商品\n",
    "    not_iter = []\n",
    "    for i in range(m):\n",
    "        if interaction[0, i] == 0:  # 用户 user 未打分项\n",
    "            not_iter.append(i)\n",
    "\n",
    "    # 对没有互动过的物品进行预测\n",
    "    predict = {}\n",
    "    for x in not_iter:\n",
    "        item = np.copy(interaction)  # 获取用户 user 对物品的互动信息\n",
    "        for j in range(m):   # 对每一个物品\n",
    "            if item[0, j] != 0:  # 利用互动过的物品预测\n",
    "                if x not in predict:\n",
    "                    predict[x] = w[x, j] * item[0, j]\n",
    "                else:\n",
    "                    predict[x] = predict[x] + w[x, j] * item[0, j]\n",
    "    # 按照预测的大小从大到小排序\n",
    "    return sorted(predict.items(), key=lambda d: d[1], reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "ItemCF的一个优势就是可以提供推荐解释，即利用用户历史上喜欢的物品为现在的推荐结果进行解释。\n",
    "\n",
    "下面的伪码实现了带解释的ItemCF算法：\n",
    "\n",
    "```python\n",
    "def Recommendation(train, user_id, W, K):\n",
    "    rank = dict()\n",
    "    ru = train[user_id]\n",
    "    for i, pi in ru.items():\n",
    "        for j, wj in sorted(W[i].items(), key=itemgetter(1), reverse=True)[0:K]:\n",
    "            if j in ru:\n",
    "                continue\n",
    "            rank[j].weight += pi * wj\n",
    "            rank[j].reason[i] = pi * wj\n",
    "    return rank\n",
    "```\n",
    "\n",
    "**IUF** （Inverse User Frequence），用户活跃度对数的倒数。活跃用户对物品相似度的贡献应该小于不活跃的用户，增加IUF参数来修正物品相似度的计算公式：\n",
    "\n",
    "\\begin{equation}\n",
    "w_{ij} = \\frac{\\sum_{u \\in N(i) \\cap N(j)} \\frac{1}{\\log{1 + |N(u)|}}}{\\sqrt{|N(i)||N(j)|}} \\nonumber\n",
    "\\end{equation}\n",
    "\n",
    "上面的公式只是对活跃用户做了一种软性的惩罚，但对于很多过于活跃的用户，比如上面一位买了网站上80%图书的用户，为了避免相似度矩阵过于稠密，在实际计算中一般直接忽略他的兴趣列表，而不将其纳入到相似度计算的数据集中。\n",
    "\n",
    "```python\n",
    "def item_similarity(train):\n",
    "    import math\n",
    "\n",
    "    # calculate co-rated users between items\n",
    "    C = dict()\n",
    "    N = dict()\n",
    "    for u, items in train.items():\n",
    "        for i in users:\n",
    "            N[i] += 1\n",
    "            for j in users:\n",
    "                if i == j:\n",
    "                    continue\n",
    "                C[i][j] += 1 / math.log(1 + len(items) * 1.0)\n",
    "    # calculate finial similarity matrix W\n",
    "    W = dict()\n",
    "    for i, related_items in C.items():\n",
    "        for j, cij in related_items.items():\n",
    "            W[u][v] = cij / math.sqrt(N[i] * N[j])\n",
    "    return W\n",
    "```\n",
    "\n",
    "将上面的算法记为 **ItemCF-IUF** 。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "<a id=\"org67463de\"></a>\n",
    "\n",
    "## 物品相似度的归一化\n",
    "\n",
    "Karypis研究发现如果将ItemCF的相似度矩阵按最大值归一化，可提高推荐的准确率。<sup><a id=\"fnr.2\" class=\"footref\" href=\"#fn.2\">2</a></sup>\n",
    "\n",
    "如果已经得到了物品相似度矩阵 $w$ ，那么可以用如下公式得到归一化之后的相似度矩阵 $w'$ ：\n",
    "\n",
    "\\begin{equation}\n",
    "w_{ij}^{'} = \\frac{w_{ij}}{\\underset{j}{\\max} w_{ij}} \\nonumber\n",
    "\\end{equation}\n",
    "\n",
    "归一化的好处不仅仅在于增加推荐的准确度，它还可以提高推荐的覆盖率和多样性。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "<a id=\"org0d23b61\"></a>\n",
    "\n",
    "## UserCF和ItemCF比较\n",
    "\n",
    "**UserCF的推荐结果着重于反映和用户兴趣相似的小群体的热点，而ItemCF的推荐结果着重于维系用户的历史兴趣。**\n",
    "\n",
    "**UserCF的推荐更社会化，反映了用户所在的小型兴趣群体中物品的热门程度，而ItemCF的推荐更加个性化，反映了用户自己的兴趣传承。**\n",
    "\n",
    "个性化新闻推荐更加强调抓住新闻热点，热门程度和时效性是个性化新闻推荐的重点，而个性化相对于这两点略显次要。\n",
    "\n",
    "在新闻网站中，物品的更新速度远远快于新用户的加入速度，而且对于新用户，完全可以给他推荐最热门的新闻，因此UserCF利大于弊。\n",
    "\n",
    "图书、电子商务和电影网站中，用户的兴趣是比较固定和持久的。这些网站中个性化推荐的任务是帮助用户发现和他研究领域相关的物品。这些网站的物品更新速度不会特别快，一天一次更新物品相似度矩阵对它们来说不会造成太大的损失，是可以接受的。\n",
    "\n",
    "|          | UserCF                                                                                                                                                                                                               | ItemCF                                                                                                                              |\n",
    "|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|\n",
    "| 性能     | 适用于用户较少的场合，如果用户很多，计算用户相似度矩阵代价很大                                                                                                                                                       | 适用于物品数明显小于用户数的场合，如果物品 很多(网页)，计算物品相似度矩阵代价很大                                                   |\n",
    "| 领域     | 时效性较强，用户个性化兴趣不太明显的领域                                                                                                                                                                             | 长尾物品丰富，用户个性化需求强烈的领域                                                                                              |\n",
    "| 实时性   | 用户有新行为，不一定造成推荐结果的立即变化                                                                                                                                                                           | 用户有新行为，一定会导致推荐结果的实时变化                                                                                          |\n",
    "| 冷启动   | 在新用户对很少的物品产生行为后，不能立即对他进行个性化推荐，因为用户相似度表是每隔一段时间离线计算的。<br><br>新物品上线后一段时间，一旦有用户对物品产生行为，就可以将新物品推荐给和对它产生行为的用户兴趣相似的其他用户 | 新用户只要对一个物品产生行为，就可以给他推荐和该物品相关的其他物品 <br><br>但没有办法在不离线更新物品相似度表的情况下将新物品推荐给用户 |\n",
    "| 推荐理由 | 很难提供令用户信服的推荐解释                                                                                                                                                                                         | 利用用户的历史行为给用户做推荐解释，可以令用户比较信服                                                                              |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "<a id=\"org27435e1\"></a>\n",
    "\n",
    "## 哈利玻特问题\n",
    "\n",
    "ItemCF计算物品相似度的经典公式：\n",
    "\n",
    "\\begin{equation}\n",
    "w_{ij} = \\frac{|N(i) \\cap N(j)|}{\\sqrt{|N(i)||N(j)|}} \\nonumber\n",
    "\\end{equation}\n",
    "\n",
    "为解决哈利波特问题，可以在分母上加大对热门物品的惩罚，采用如下公式：\n",
    "\n",
    "\\begin{equation}\n",
    "w_{ij} = \\frac{|N(i) \\cap N(j)}{|N(i)|^{1 - \\alpha} |N(j)|^{\\alpha}} \\nonumber\n",
    "\\end{equation}\n",
    "\n",
    "其中 $\\alpha \\in [0.5,1]$ 。通过提高 $\\alpha$ ，就可以惩罚热门的 $j$ 。\n",
    "\n",
    "两个不同领域的最热门物品之间往往具有比较高的相似度。此时，仅仅靠用户行为数据是不能解决这个问题的，因为用户的行为表示这种物品之间应该相似度很高。只能依靠引入物品的内容数据解决这个问题，比如对不同领域的物品降低权重等。这些就不是协同过滤讨论的范畴了。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ein.tags": "worksheet-0",
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "## 脚注\n",
    "\n",
    "<sup><a id=\"fn.1\" class=\"footnum\" href=\"#fnr.1\">1</a></sup> 参见Linden Greg、Smith Brent和York Jeremy的“Amazon.com Recommendations: Item-to-Item Collaborative Filtering.” (IEEE Internet Computing，2003)。\n",
    "\n",
    "<sup><a id=\"fn.2\" class=\"footnum\" href=\"#fnr.2\">2</a></sup> 参见George Karypis的论文“Evaluation of Item-based Top-N Recommendation Algorithms”。\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "name": "6-item-based-collaborative-filtering.ipynb"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
